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Abstract 

While much effort has focused on detecting positive and negative directional selection in the human genome, relatively 
little work has been devoted to balancing selection. This lack of attention is likely due to the paucity of sophisticated 
methods for identifying sites under balancing selection. Here we develop two composite likelihood ratio tests for detecting 
balancing selection. Using simulations, we show that these methods outperform competing methods under a variety of 
assumptions and demographic models. We apply the new methods to whole-genome human data, and find a number of 
previously-identified loci with strong evidence of balancing selection, including several HLA genes. Additionally, we find 
evidence for many novel candidates, the strongest of which is FANKl, an imprinted gene that suppresses apoptosis, is 
expressed during meiosis in males, and displays marginal signs of segregation distortion. We hypothesize that balancing 
selection acts on this locus to stabilize the segregation distortion and negative fitness effects of the distorter allele. Thus, our 
methods are able to reproduce many previously-hypothesized signals of balancing selection, as well as discover novel 
interesting candidates. 
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Introduction 

Balancing selection maintains variation within a population. 
Multiple processes can lead to balancing selection. In overdom- 
inance, the heterozygous genotype has higher fitness than either of 
the homozygous genotypes [1,2]. In frequency-dependent balanc- 
ing selection, the fitness of an allele is inversely related to its 
frequency in the population [2,3]. In a fluctuating or spatially- 
structured environment, balancing selection can occur when 
dilferent alleles are favored in different environments over time 
or geography [2,4,5]. Finally, balancing selection can also be a 
product of opposite directed effects of segregation distortion 
balanced by negative selection against the distorter [6]. That is, 
segregation distortion leads to one allele increasing in frequency. 
However, if that allele is deleterious, then it is reduced in 
frequency by negative selection. The combined effect of these 
opposing forces can lead to a balanced polymorphism. 

The genetic signatures of long-term balancing selection at a 
locus can roughly be divided into three categories [2]. The first 
signature is that the distribution of allele frequencies will be 
enriched for intermediate frequency alleles. This occurs because 
the selected locus itself is likely at moderate frequency within the 
population and, thus, neutral linked loci wiU also be at 
intermediate frequency. The second signature is the presence of 
trans-specific polymorphisms, which are polymorphisms that are 
shared among species [7]. This is a result of alleles being 



maintained over long evolutionary time periods, sometimes for 
millions of years [8-10]. The third signature is an increased 
density of polymorphic sites. This is due to linked neutral loci 
sharing similar deep genealogies as that of the selected site, 
increasing the probability of observing mutations at the neutral 
loci. 

The majority of selection scans in humans have focused on 
positive and negative directional selection. These studies have 
found evidence of both types of selection, with negative selection 
being ubiquitous, and the amount and mechanism of positive 
selection currently being debated [11-13]. However, it is unclear 
how much balancing selection exists in the human genome. Some 
scans for balancing selection (e.g., Bubb et al. [14] and Andres 
et al. [15]) have been carried out using summary statistics such as 
the Hudson-Kreitman-Aguade (HKA) test [16] and Tajima's D 
[17] as well as combinations of summary statistics [15,18] (though 
see Segural et al. [7] and Lefiler et al. [19] for recent 
complementary approaches). The power of such approaches in 
unclear, and so it is uncertain how important balancing selection 
is in the human genome. Because balancing selection shapes the 
genealogy of a sample around a selected locus, more power can 
be gained by implementing a model of the genealogical process 
under balancing selection [20,21]. Composite likelihood methods 
have proven to be extremely useful for the analysis of genetic 
variation data using complex population genetic models. [22-28]. 
This approach allows estimation under models without requiring 
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Author Summary 

In the past, balancing selection was a topic of great 
theoretical interest that received much attention. Howev- 
er, there has been little focus toward developing methods 
to identify regions of the genome that are under balancing 
selection. In this article, we present the first set of 
llkellhood-based methods that explicitly model the spatial 
distribution of polymorphism expected near a site under 
long-term balancing selection. Simulation results show 
that our methods outperform commonly-used summary 
statistics for identifying regions under balancing selection. 
Finally, we performed a scan for balancing selection in 
Africans and Europeans using our new methods and 
identified a gene called FANKl as our top candidate 
outside the HLA region. We hypothesize that the mainte- 
nance of polymorphism at FANKl Is the result of 
segregation distortion. 

full likelihood calculations, permitting many complex models to 
be investigated. 

In this article, we develop two composite likelihood ratio 
methods to detect balancing selection, which we denote by Ti and 
T2 . These methods are based on modeling the effect of balancing 
selection on the genealogy at linked neutral loci (e.g., Kaplan et al. 
(1988) [20] and Hudson and Kaplan (1988) [21]) and take into 
consideration the spatial distributions of polymorphisms and 
substitutions around a selected site. Through simulations, we 
show that our methods outperform both HKA and Tajima's D 
under a variety of demographic assumptions. Further, we apply 
our methods to autosomal whole-genome sequencing data 
consisting of nine unrelated European (CEU) and nine unrelated 
African (YRI) individuals. We find support for multiple targets of 
balancing selection in the human genome, including previously 
hypothesized regions such as the human leukocyte antigen (HLA) 
locus. Additionally, we find evidence for balancing selection at the 
FANKl gene, which we hypothesize to result from segregation 
distortion. 

Results 

Theory 

A new test for balancing selection. In this section, we 
provide a basic overview of a new test for balancing selection, and 
we describe the method in greater detail in the sections entitled 
Kaplan-Darden-Hudson model, Solving the recursion relation, A 
composite likelihood ratio test based on polymorphism and substi- 
tution, and A composite likelihood ratio test based on frequency 
spectra and substitutions sections. We have developed a new 
statistical method for detecting balancing selection, which is based 
on the model of Kaplan, Darden, and Hudson [20,2 1] (full details 
provided in the Kaplan-Darden-Hudson model section). Under 
this model, we calculate the expected distribution of allele 
frequencies using simulations, and approximate the probability 
of observing a fixed difference or polymorphism at a site as a 
function of its genomic distance to a putative site under balancing 
selection. Using these calculations, we construct composite 
likelihood tests that can be used to identify sites under balancing 
selection, similar to the approaches by Kim and Stephan [23] and 
Nielsen el al. [26] for detecting selective sweeps. 

Basic framework. Consider a biallelic site S that is under 
strong balancing selection and maintains an allele A \ at frequency 
X and an allele A2 at frequency \ —x. Consider a neutral locus / 
that is linked to the selected locus S. Denote the scaled 



recombination rate between the selected locus and the neutral 
locus as Pi = INr,, where N is the diploid population size and r,- is 
the per-generation recombination rate. Assume we have a sample 
of n genomes from an ingroup species (e.g., humans) and a single 
genome from an outgroup species (e.g., chimpanzee). From these 
data, we can estimate the genome-wide expected coalescence time 
C between the ingroup and outgroup species (see Materials and 
Methods for details). Also, under the Kaplan-Darden-Hudson 
model, we can obtain the expected tree length L„(x,p) and height 
Hn{x,p) for a sample of w lineages affected by balancing selection 
by solving a set of recursive equations using the numerical 
approach described in the Solving the recursion relation. The 
relationship among C, L„(x,p), and H„(x,p) is depicted in 
Figure lA. Assuming a small mutation rate, the probability that a 
site is polymorphic under a model of balancing selection, given 
that it contains either a polymorphism or a substitution (fixed 
difference), is 

Pn,p.x = — , (1) 

2C - H„(x,p) + L„{x,p) 

and the conditional probability that it contains a substitution is 
Sn,p,x = 1 —Pn,p,x- That is, conditional on a mutation occurring on 
the genealogy relating the n ingroup genomes and the outgroup 
genome, the probability that a site is polymorphic is the 
probability that a mutation occurs before the most recent common 
ancestor of the n ingroup species (i.e., mutation occurs on red 
branches indicated in Fig. 15), and the probability that a site 
contains a substitution is the probability that a mutation occurs 
along the branch leading from the outgroup sequence to the most 
recent common ancestor of the n ingroup species [i.e., mutation 
occurs on blue branches indicated in Fig. IC). 

Figure ID shows how die spatial distribution of polymorphism 
around a selected site is influenced by the underlying genealogy at 
the site and how this spatial distribution of polymorphism can be 
used to provide evidence for balancing selection. Within a window 
of sites, we can obtain the composite likelihood that a particular 
site is under selection by multiplying the conditional probability of 
observing a polymorphism or a substitution at every other neutral 
site as a function of the distance of the neutral site to the balanced 
polymorphism. 

Kaplan-Darden-Hudson model. The genealogy of a neu- 
tral locus i linked to the selected locus S can be traced back in time 
using the Kaplan, Darden, and Hudson [20,21] model, which 
provides a framework for modeling the coalescent process at a 
neutral locus that is linked to a locus under balancing selection. 
This model assumes that the selected locus maintains a balanced 
polymorphism that is infinitely old. Their framework involves 
modehng selection as a structured population containing two 
demes representing each of the two allelic classes and migration 
taking the role of recombination and mutation. Lineages within 
the first deme are linked to A\ alleles and lineages within the 
second deme are linked to A2 alleles. Lineages migrate between 
demes by changing their genomic background. That is, a lineage 
in the first deme will migrate to the second deme if there was a 
mutation that changed an A\ allele to an A2 allele or if there was a 
recombination event that transferred a lineage linked to an ^1 
aUele to an A2 background. Similarly, a lineage in the second 
deme will migrate to the first deme if there was a mutation that 
changed an A2 allele to an A\ aUele or if there was a 
recombination event that transferred a lineage linked to an A2 
aUele to an y4i background. The rate at which a lineage linked to 
an A\ background transfers to an A2 background is 
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Figure 1. Calculation of probabilities of polymorphism and substitution under a model of balancing selection and the 
incorporation of these probabilities into a genome scan. (A) Relationship among tree length L„(x,p), tree height H„(x,p) and inter-specific 
coalescence time C. (6) A site is polymorphic if a mutation occurred on the L„{x,p) length of branches until the most recent common ancestor of the 
ingroup sample (red region). (Q A site is a substitution if a mutation occurred on the 2C — H„(x,p) length of branches that represent the divergence 
between the outgroup species and the most recent common ancestor of the ingroup species (blue region). (D) Height and length of genealogies in 
relationship to their spatial proximity to a selected site and how the shapes of these genealogies affect the pattern of polymorphism around the site. 
The composite likelihood ratio is high near a selected site as there is an excess of polymorphisms close to the site and a deficit far from the site. 
doi:1 0.1 371 /journal.pgen.1 004561 .gOOl 



jS] = 01 + p,(l —x) and the rate at which a hneage linked to an 
background transfers to a.n A\ background is /?2 = (?2 + P/^- 

Consider a sample of n lineages with k lineages Linked to allele 
A\ [i.e., in the frrst deme) and n — k lineages linked to allele A2 
{i.e., in the second deme). Given this configuration, only four 
events are possible. The first event involves a coalescence of a pair 
of lineages linked to Ai alleles, the second involves a coalescence of 
a pair of lineages linked to A2 alleles, the third involves the transfer 
of a lineage from an Ai background to an A2 background, and the 
fourth involves the transfer of a lineage from an A2 background to 
an Ai background. The time until the first event {i.e., a 
coalescence or a transfer of background) is exponentially 
distributed with rate 



'kk.n-k{x,p)-- 



n-k\ 

2 ) kP2(l-x) {n-k)(iyx 



1- 



+ - 



1 



(2) 



The probability that the event is a coalescence of a pair of fl- 
unked lineages is 



X'kk.n-k(x,p) 

the event is a coalescence of a pair of /l2-linked lineages is 



(3) 



the event is a transfer from an A\ to an ^2 background : 



(1) , ^ kfi2(\-x) 

Xkk,n-k(X,P) 



(5) 



and the event is a transfer from an A2 to an y4i background is 

(n — k)P^x 



i^kl-kix^P)-- 



{l-x)Xk.n-kix,p)' 



(6) 



Note that in the notation of Kaplan el al. (1988) [20], 

h,n-k(x,p) = hk,„-k(x), 4',i-*r(^'P) = Ik- l.n-k(x), cfl_i^(x,p) = 

qk.n-k-i{x), m]^l_i^(.x,p) = qk-i_„-k+i(x), and rnf^„_i^(x,p) = 

qk+l,n-k-l(x). 

Let Lk,n-k(x,p) denote the expected tree length given a sample 
with k ^i-linked lineages and n — k y42-linked lineages. Using eq. 
18 of Kaplan et al. (1988) [20], the expected total tree length can 
be expressed using the recursion relation 



Lk,„-k(x,p)-- 



+ '^kl-k(x^P)Lk- l.n-k(x,p) 



h,n-k{x,p) 



+ "i'i _ 4 (^,P)i/t - 1 ,« - * + 1 {x,p) 



(7) 



+ '^'iT,n-k'y^^P)^k+\.«-k-\{x,p). 



C. 



-k 



Similarly, the expected tree height Hk,n-k(x,p) given a sample 



(2) / \_ V 2 / , , with k ^ I -linked lineages and n — k ^2 "linked lineages can be 



k,n — k 



('''P^=n t ^' 

{I —x)kk^„-k(x,p) expressed by 
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Hk,n-k{x,p)-- 



1 



h,n-k{x,p) 



+ cYn_k{x,p)Hk,n-k-\{x,p) 



+ 'nki-k(^'P)^k-l,n-k+l(x,p) 



+mf*„_k(x,p)Hk+i^-.k~i{x,p). 



with (m+ l)-dimensional main diagonal diag(M*"')= [1,1, ... ,1], 
«-dimensional lower diagonal 

lower(MW)= [-m<;>_i(x,p),-m<;>_2(^.P)' • ■ • ,-«l'i(x,p)], 
and «-dimensional upper 



diagonal 



(8) uppeTiM(''>) = [-m^^l(x,p),-mfl_i(x,p), 



'^n-l.\(x,P)\- 



AU elements that do not fall on the main, lower, and upper 
diagonals of M^"^ are zero. 

Given M*"\ b^"', and S"\ we can rewrite the recursion relation 
in eq. 7 as system of equations 



Solving the recursion relation 

Consider a sample of « lineages. Denote the (n+ l)-dimensional 
vector of tree lengths for a sample of size n as 



Lo,n(x,P) 

Li^-i(x,p) 

L2,n-2(X,P) 

L„,o{x,p) 



such that element fc, k = Q,\,...,n, of 6"^ is 6^^^ =Lk^-k(x,p). 
Next, define the iri + l)-dimensional vector 



b(«) 



h),n(x,P) 



" + f^i'L 1 (x,p)Lo,„-i(x,p) + cfl_^{x,p}Ll„-2{x,p) 



" +C2i-2(x,P)Ll,„-2(x,p) + cfl_2(x,p)L2,„-3{x,p) 



>^2,n-2(x,P) 



7 — 7 — T +c'-^i(x,p)L„-ifi(x,p) 



]y[(«)^n)_|,(n)^ 



(9) 



Because we can calculate eqs. 5 and 6, M<"' IS a constant matrix. 
For a sample of size n, suppose we know for a sample of size 
n—l. Therefore. Z"'"^'* is now a constant vector and hence, 
because we can calculate eqs. 2-4, b*"' is also a constant vector. 
Therefore, eq. 9 is a tridiagonal system of « + 1 equations with 
n+l unknowns, which can be solved in 0{n) time using the 
tridiagonal matrix algorithm [29]. 

The base case for the recursion in eq. 8 is when the number of 
lineages equals one. That is, when all lineages have coalesced and 
the most recent common ancestor is link(;(l (-ither to an ^ i allele or 
to an A2 allele. This base case can be represented by Lo,i(^>p) = 0 
and Lio{x,p) = 0. Given these values, set ^*'' = 
[Lq\(x,p),L]q(x,p)] = [0,0] and solve the system of equations 

]Vl(2)^(2)^j,(2) j.^^ ^(2)^ j^g^j^ gj^g^ £(2)^ jgjyg ^j^g jyjjgjj^ 

equations processes until 

]y[(«)^(«) ^J)*") is solved for An analogous process can be used 
to solve the recursion (eq. 8) for the expected tree height. 

Using the framework in this section for a sample of size n, we 
can obtain values for Lo^„{x,p),Li^„-i{x,p), . . . ,L„fi{x,p). Given 
that the Ai allele has frequency x and the A2 allele has frequency 
l—x, the expected tree length for a sample of size n is 



such that element 0 is 



k = 0 



Ln{x,p) = S ( ^ J x''{\ -xf-''Lk,„-k{x,p). 



(10) 



. («)_ _n_ (2). 



element n is 



b('') = 



+c%ix,p)t^:i^ 



" ^n,0(x,p) 

and element k, k=l,2, . . . ,n—l is 



bi"'= +<^i-k(x,P)(i:l'+<^-k(x,P)<^-''. 



h,n-k(x,p) 



Further, consider an («+ 1) x («+ l)-dimensional tridiagonal 
matrix of migration rates 



M("' = 



1 -KiM 0 0 

-mm_i(x,p) 1 -mfl_i(x,p) 0 



"2,»-2>,- 

0 



0 
0 

0 



Similarly, we can obtain the (-xpccted tree height H„{x,p) for a 
sample of size n. The tree heights and total branch lengths are then 
used in eq. 1 to compute the likelihood of the data under the 
selection model. 

A composite likelihood ratio test based on polymorphism 
and substitution. In this section, we illustrate how eq. 1 can he- 
incorporated into a composite likelihood. We wiU then describe a 
likelihood ratio test that compares the balancing selection model 
described above to a neutral model based on the background 
genome patterns of polymorphism. Consider a window of / sites 
that are either polymorphisms or substitutions and consider a 
putatively selected site S located within the window. Suppose site / 
within the window has n,- sampled alleles, a, observed ancestral 
alleles, and is a recombination distance of Pj from S. Let 
n= [«i,«2, ■ • • ,w/l, a = [01,02, • • • and p=[pi,P2, ■ ■ ■ ,Pj]. De- 



fine the indicator random variable 1 



{a, = k} 



that site / has k 



ancestral alleles. Using the Kaplan-Darden-Hudson model, the 
probability that site / is polymorphic is Pni,pi,x aid the probability 
that the site is a substitution (or fixed difierence) is 
Sn,,p„x = ^—Pn,,pi,x- Under the model, the composite likelihood 
that site S is under balancing selection is 
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£M(n,p,x ; a) = 



/ 

n 

i=i 



k} 



which is maximized at x - 



k=l 

argmax /i /_ \ 

= xs(0,i)^M(n,/», x; a). 



(11) 



Notice that 



sampling distribution for a site depends on the distance to the 
selected locus. In this method, as in previous composite likelihood 
methods for detecting selection, there is therefore no need for 
weighting sites depending on their distance from the selected sites. 
Such weighting is already incorporated in the probabilistic model. 
Similarly, there is no need for sliding windows, or the use of 
Hidden Markov Models (HMMs) to indicate the selected region. 
The likelihood ratio can, in principle, be calculated for any point 
in the genome, taking all other points in the genome into account. 
However, for practical computational reasons, we only calculate 
the likelihood ratio for a site using nearby sites in a fixed window 
of 100 substitutions or polymorphisms upstream and downstream 
of the focal site. As the distance from the selected site increases, 
litde is gained by incorporating information from more sites. 

Further, suppose that for a sample of size k, k = 2,?i, . . . ,n, 
conditioning only on sites that are polymorphisms or substitutions, 
the proportion of loci across the genome that are polymorphic is 
Pk and the proportion of loci that are substitutions is = 1 —pk- 
Then the composite likelihood that site S is evolving neutrally is 



£B(n; a)=n 

1=1 



"i - • 

=o}+p„i ^{"i 



-k] 



(12) 



It follows that the composite likelihood ratio test statistic that site 
S is under balancing selection is =2{ln[£M(n, /», x ; a)] — 
ln[£B(n;a)]}. 

A composite likelihood ratio test based on frequency 
spectra and substitutions. A balanced polymorphism not 
only increases the number of polymorphisms at linked neutrad 
sites, but also leads to an increase in minor allele frequencies at 
these sites. Therefore, power can be gained by using frequency 
spectra information in addition to information on the density of 
polymorphisms and substitutions. 

Given a sample of size n,a.nAi allele at frequency x, allele at 
frequency l—x, and a polymorphic neutral site that is p 
recombination units from a selected site, we can obtain the 
probability Pn,k,p,x that there are k, k=\,2, . . . ,n~\, ancestral 
alleles observed at the neutral site. The composite likelihood that 
site S is under balancing selection is 



£M(n, p, x; a)=n 



Snj,Pi,x^{ai = (l}+Pnj,Pi^ PniJc,Pi^^{ai=k} .J3) 



which is maximized at 50 = "^^^^^ Cyiij^, p, x ; a). 

Further, suppose that for a sample of size k, k = 2,3, ...,«, 
conditioning only on sites that are polymorphisms or substitutions, 
the proportion of polymorphic loci across the genome that have J, 
j=l,2, . . . ,k—\, ancestral alleles is piij. Then the composite 
likelihood that site S is evolving neutrally is 



'-i / 

£B(n ; a) = n [S„,. !{„,. =0} +^«,- ='t}] ■ (14) 

k=\ 

It follows that the composite likelihood ratio test statistic that site 
S is under balancing selection is 72 =2{ln[£M(n, />, i: ; a)] — 
ln[£B(n; a)]}. Because it is computationally difficult to derive 



analytical formulas for frequency spectra under the Hudson- 
Darden-Kaplan model, we approximate these distributions by 
simulating frequency spectra under the Hudson-Darden-Kaplan 
model for a range of equilibrium frequencies x and recombination 
parameters p. We then use a look-up table to identify the optimal 
spectrum to use, and if the optimum is intermediate between two 
spectra, the two closest distributions are employed. The two new 
methods, Ti and T2, have been implemented in the software 
package BALLET (BALancing selection LikElihood Test), which 
is written in C and is available at http://www.personal.psu.edu/ 
mxdGO/ software.html. 

Evaluating the methods using simulations 

To evaluate the performance of T\ and relative to HICA and 
Tajima's D, we carried out extensive simulations of balancing 
selection using different selection and demographic parameters. 
We simulated genomic data for a pair of species that diverged To 
years ago. We introduced a site that is under balancing selection at 
time Ts, and the mode of balancing selection at the sit(^ is 
overdominance with selection strength s and dominanct' param- 
eter h. In the simulations discussed in this article, we varied the 
demographic history in the target ingroup species, the strength of 
selection s, the dominance parameter h, and the time at which the 
selected allele arises tj. We consider two values for the strength of 
selection, .y=10^* and 10^^, five values for the dominance 
parameter, h = 100, 10, 3, 1 .5, and 1 . 1 25, and three times at which 
the selected allele arises, 15 = 10^, 5x10*, and 1.5x10^ years 
ago. Under the overdominance model considered here, the 
equilibrium frequency occurs at (h— l)/(2h— I) yielding equilib- 
rium frequencies of 0.50, 0.47, 0.40, 0.25, and 0.10 for A =100, 
10, 3, 1.5, and 1.125, respectively. These parameters were chosen 
to represent strong (s= 10^^) and substantially weaker {s= lO^'') 
selection coefficients and a range of equilibrium frequencies. In 
addition, the time Ts = 5xlO* years ago was meant to represent 
an ancient balanced polymorphism, whereas the other two values 
for Ts represent violations of assumptions of our methods. That is, 
the trans-species polymorphism occurring at Ts= 1.5 x 10^ years 
ago violates the assumption that lineages from the ingroup species 
are necessarily monophyletic, and the recent balanced polymor- 
phism arising 15= 10^ years ago represents balancing selection on 
an allele that is young relative to the average coalescence time for 
the ingroup species. Details of how the simulations were 
implemented are further described in the Materials and Methods 
section. 

Ancient balanced polymorphism. We performed simula- 
tions under each of the three demographic models depicted in 
Figure 2. For these simulations, we constructed receiver operator 
characteristic (ROC) curves to illustrate relationships between the 
true and false positive rates of each method. Figure 3 displays 
ROC curves for Ti, T2, HKA, and Tajima's D for simulations 
where .5=10^^ and A =100. Under a model of constant 
population size (left panel of Fig. 3), T2 tends to obtain more 
true positives than Ti, Ti more true positives than HKA, and 
HKA more true positives than Tajima's D (left panel of Fig. 3). In 
practice, however, we are typically concerned with a method's 
performance at low false positive rates. For a false positive rate of 
1%, Ti, T2, HKA, and Tajima's D have true positive rates of 30, 
40, 14, and 6%, respectively. Similarly, at a false positive rate of 
5%, Ti, T2, HKA, and Tajima's D have true positive rates of 58, 
67, 37, and 25%, respectively. These results show that Ti and T2 
each vastiy outperforms both HKA and Tajima's D, with T2 
performing better than Ti. However, these simulations were 
performed using the standard neutral model, which is also the 
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Ingroup Outgroup Ingroup Outgroup Ingroup Outgroup 

Figure 2. Demographic models used in simulations in which a selected allele arises after the split a pair of species. [A] Divergence 
model. Model parameters are a diploid effective population size N, divergence time to of the ingroup and outgroup species, and the time zs when 
the selected allele arises. (6) Divergence model with a recent bottleneck within the ingroup species. Additional model parameters are the diploid 
effective population size Ni, during the bottleneck, the time Xf, when the bottleneck began, and the time t,, when the bottleneck ended. (Q 
Divergence model with recent population growth within the ingroup species. Additional model parameters are the current diploid effective 
population size Nig after recent growth and the time tig when the growth occurred. 
doi:1 0.1 371 /journal.pgen.1 004561 .g002 



demographic model assumed in Ti and Ti. Thus, to examine the 
robustness of our methods, we next considered two complex 
demographic scenarios that could potentially afiect the results of 
our methods — a population bottleneck (Fig. 25) and a population 
expansion (Fig. 2C). 

The middle panel of Figure 3 displays ROC curves under a 
model in which the ingroup species experiences a recent severe 
bottleneck (Fig. 2-6). For a false positive rate of 1%, the true 
positive rates of Ti, Ti, HKA, and Tajima's D are 75, 74, 72, and 
5%, respectively. Similarly, for a false positive rate of 5%, the true 
positive rates of Ti, Tj, HKA, and Tajima's D are 80, 81, 80, and 
14%, respectively. Thus, aside from Tajima's Z), all methods 
perform well under this demographic model. This is because a 
severe population bottleneck decreases levels of diversity across the 
genome, resulting in a lower polymorphism-to-substitution ratio. 
Because T\, T2, and HKA all compare levels of polymorphism 
and divergence at a putatively selected site to those of the 
corresponding genomic background, these methods are able to 
identify the increased diversity at a site under balancing selection. 
In contrast, Tajima's D does not perform such a comparison and. 



thus, has little power to detect balancing selection under this 
demographic scenario. 

The right panel of Figure 3 depicts ROC curves under a 
demographic model in which the ingroup species experiences 
recent population growth (Fig. 2C). As with constant population 
size, T2 tends to obtain more true positives than Ti, T\ more true 
positives than HKA, and HKA more true positives than Tajima's 
D for a given false positive rate. At a false positive rate of 1%, T\, 
Tj, HKA, and Tajima's D have true positive rates of 39, 41, 15, 
and 10%, respectively, and at a false positive rate of 5%, T\, T2, 
HKA, and Tajima's D have true positive rates of 65, 69, 37, and 
32%, respectively. Interestingly, all four methods perform better 
under a recent population growth than under a constant 
population size. This result is potentially due to less fluctuation 
in the frequency of a selected allele in the recent past when the 
population size is large. 

By considering the demographic models in Figure 2, we have 
shown that Ti and T2 generally outperform both HKA and 
Tajima's D. Next, we investigated the effect of varying h (h= 100, 
10, 3, and 1.5) when ^=10^^ (Fig. SI). Under a model with 




1.0' ' ' — I ■ ' ^ ' ' 1 

0.00 0.01 0.02 0.03 0.04 0.05 0.00 0.01 0.02 0.03 0.04 0.05 0.00 0.01 0.02 0.03 0.04 0.05 

False positive rate False positive rate False positive rate 



Figure 3. Performance of Ti, T2, HKA, and Tajima's D under the demographic models in Figure 2 with selection parameter s= 10~^ 
and dominance parameter /i = 100. The first column is the divergence model in Figure 2A. The second column is the divergence model in 
Figure 26 with a recent bottleneck within the ingroup species. The third column is the divergence model in Figure 2C with recent population growth 
within the ingroup species. 
doi:1 0.1 371/journal.pgen.1 004561 .gOOB 
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Figure 4. Performance of T], Ti, HKA, and Tajima's D under the demographic models in Figure 2 with selection parameter s= 10 
and dominance parameter /) = 100. The first column is the divergence model in Figure 2A. The second column is the divergence model in 
Figure 26 with a recent bottleneck within the ingroup species. The third column is the divergence model in Figure 2C with recent population growth 
within the ingroup species. 
doi:1 0.1 371/journal.pgen.1 004561. g004 



constant population size (Fig. 2A), Tj outperforms T\, T\ 
outperforms HKA, and HKA outperforms Tajima's D. As h 
decreases, the performances of HKA and Tajima's D decrease, 
whereas the performances of T\ and Ti are not dramatically 
aflfected. Under a model with a recent population bottleneck 
(Fig. 2fi), T\, 72, and HKA all perform well, whereas Tajima's D 
performs poorly. In this scenario, h appears to have little influence 
on the relative performance of these methods. Finally, under a 
model with a recent population expansion (Fig. 2C), Ti outper- 
forms T\ outperforms HKA, and HKA outperforms Tajima's 
D. Decreasing h results in a decrease in the performance of 
Tajima's D, but has littie influence on the performances of all 
other methods. Moreover, the performances of T\ and Ti are 
similar for all h, whereas the perforances of HKA and Tajima's D 
are similar for large h (/;= 10 and 100), and dissimilar for low h 
(A = 1.5 and 3). 

For s= 10^^, T\ and T2 generally perform quite well (Figs. 3 
and SI). However, because T\ and T2 were developed to detect 
long-term balancing selection of infinite strength, it is unclear how 
the methods perform under weak selection. To investigate this 
scenario, we considered 5=10^'*, with /?>10 representing 
relatively strong balancing selection [i.e., relatively high hs) and 
/!<10 representing relatively weak balancing selection [i.e., 
relatively low hs). For /?= 100 (Fig. 4), we find that the relative 
performance of the four methods are similar to those in the case of 
strong selection {s= 10^^). Curiously, all methods perform better 
when s= lO^'' (Fig. 4) than when s= 10^^ (Fig. 3). To investigate 
the factors influencing this strange behavior, we plotted the mean 
difference in the number of polymorphic sites for a scenario with 
s=\Q^'^ and h=\QQ verses one with J=10^^ and /!=100 as 
function of the distance from the site under balancing selection 
(Fig. S2). We find that, on average, there are more polymorphic 
sites when the selection coefficient is weak, with the difference in 
numbers of polymorphic sites disappearing with increasing 
distance from the site under selection. This phenomenon is due 
to a drop in local effective population size near the site under 
balancing selection for the scenario with strong selection. Because 
h is so large [h = 100) and the population size is finite, heterozygous 
individuals leave a disproportionately large fraction of offspring in 
the next generation, therefore causing an apparent drop in local 
effective size near the site under selection. 



When .5=10^'' under a model of constant population size 
(Fig. 2/4), Tj outperforms T\, T\ outperforms HKA, and HKA 
outperforms Tajima's D when h is large {h = 10 and 100; Fig. S3), 
similar to what we observe when s= 10^^ (Fig. SI). In contrast to 
our observations when i = 10^^, all methods perform poorly when 
/; is small (A = 1.5 and 3), each identifying signatures of selection 
only slightly better than random (Fig. S3). Hence, when the 
selection coefficient is weak and the level of overdominance is low, 
T\ and T2 cannot extract enough information from the data to 
make meaningful predictions. However, HKA and Tajima's D 
perform just as poorly, and therefore T\ and T2 generally 
outperform HKA and Tajima's D under a demographic model 
with constant population size. 

Next, when .5=10^'' under a model with a recent population 
bottleneck (Fig. 2B), T\, T2, and HKA all perform well, whereas 
Tajima's D performs poorly (Fig. S3), similar to what we observe 
when s= 10^^ (Fig. SI). In contrast to the results for s= 10^^, h 
has some influence on the relative performance of these methods. 
As h decreases, the performance of all methods decreases — though 
not substantially. In addition, similarly to what we observe when 
s=lO^^, the performances of Ti, T2, and HKA are approxi- 
mately the same. Hence, even under weak selection coefficients, 
population bottlenecks tend to enhance the performance of T\, T2, 
and HKA, whereas they inhibit the performance of Tajima's D. 

Finally, when s= 10^* under a model with a recent population 
expansion (Fig. 2C), T2 outperforms T], T\ outperforms HKA, 
and HKA outperforms Tajima's D for large h (h=\Q and 100; Fig 
S3), as observed when s= 10^^ (Fig. SI). In contrast to the results 
for the case of .?= 10^^, all methods perform poorly when h is 
small (/2=1.5 and 3). Hence, like the case under constant 
population size, when the selection coefficient is weak and the 
level of overdominance is low, T\ and T2 cannot extract enough 
information from the data to make meaningful predictions. 
However, HKA and Tajima's D perform just as poorly, and 
therefore T\ and T2 generally outperform HKA and Tajima's D 
under a demographic model with recent population growth. 

So far the lowest dominance parameter considered here was 
A = 1.5, which has an equilibrium frequency of 0.25. To further 
assess the limits of our methods, we considered h=\ .125, which has 
a substantially smaller equilibrium frequency of 0.10. When 
s= 10^^, we find that all four methods perform poorly under the 
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constant population size (Fig. 2A) and growth (Fig. 2C) models (Fig. 
S4). In contrast, as with the higher equilibrium frequencies (Fig. SI), 
Ti, T2, and HKA statistics performed well, whereas Tajima's D 
performed poorly under the botdeneck (Fig. 2B) model (Fig. S4). 

We next examined violations in recombination rate assumptions 
of Ti and T2 by investigating the robustness of Ti and T2 to error 
in recombination rate estimation. For each simulation, we 
assumed a recombination rate of 2.5xl0~* per site per 
generation. We first wanted to investigate whether using an 
incorrect recombination map would increase the chances that T\ 
and T2 identify false positive. Figure S.5 depicts results under a 
model with constant population size (Fig. 2A) in which there is no 
selected allele. With respect to identifying false signals of balancing 
selection, our results indicate that Ti and T2 are robust to 
recombination rate underestimation and overestimation. We next 
wanted to examine whether using an incorrect recombination map 
would influence the power of Ti and T2 to identify ancient 
balanced polymorphisms. Figure S6 depicts results for a model 
with constant population size (Fig. 2/4) with time of selection 
Ts = 5xlO^ s=lO-^, large (h=lOO) and smaU (h=l.5) domi- 
nance parameters, and recombination rate overestimated by one 
or two orders of magnitude and underestimated by one or two 
orders of magnitude. We do not consider A =1.125 due to the 
poor performance of all methods considered here for that 
parameter setting. Incorrectly inferring an order of magnitude 
higher recombination rate slightiy improves the performance of 
both T] and T2. However, incorrectly inferring a two orders of 
magnitude higher recombination rate yields poor performance for 
both Ti and T2 under reasonable false positive rates (e.g., less than 
5%). Incorrectiy inferring the recombination rate by one or two 
orders of magnitude lower than the truth does not vastiy alter the 
power for Ti, but substantially decreases the power of T2. 

Ancient trans-species balanced polymorphism. One 
hallmark of balancing selection is that it maintains polymorphism 
for a long time, potentially for millions of years [8—10]. Thus, some 
balanced polymorphisms, referred to as trans-specific polymor- 
phisms, are shared across multiple species. Figure S7 displays the 
three demographic models that we consider in which a selected 
allele arises in the j)oj)ulation ancestral to the split of the ingroup 
and outgroup species. For each demographic scenario, we set 
15- = 1.5 X 10' years ago, creating a selected allele that is three 
times as ancient as the one that we consider in Figure 2. All other 
model parameters are identical to those considered in Figure 2. 

Figures S8 and S9 indicate that the performances of Ti, T2, 
HKA, and Tajima's D are not greatly affected by considering an 
ancient trans-species balanced polymorphism when compared to 
an ancient balanced polymorphism that occurred more recentiy 
than the split of a pair of species. This is important because the 
scenario of an ancient trans-species balanced polymorphism is a 
violation of the assumptions of the model since it forces lineages 
from the ingroup species to not be monophyletic with respect to 
the outgroup species. Henc:e, though Ti and T2 make the 
assumption that lineages from the ingroup species are monophy- 
letic, this assumption does not hinder the methods in practice. 

Young balanced polymorphism. The two methods devel- 
oped in this article assume that selection is infinitely strong and 
that the balanced polymorphism is infinitely old. Here we consider 
the performance of Ti, T2, HKA, and Tajima's D under a 
scenario in which a young balanced polymorphism arose = lO' 
years ago. Considering selection coefficients 5=10^^ (Fig. SIO) 
and S=10^* (Fig. Sll), all four methods performed poorly under 
the constant size and growth demographic scenarios, regardless of 
the dominance parameter. In contrast, Ti, T2, and HKA all 
perform well and Tajima's D performs poorly under the 



botdeneck scenario, similar to the results for the ancient balanced 
polymorphisms. These results show that the new methods have 
limited power to detect young balanced polymorphisms, 
except under a scenario in which the background density of 
polymorphisms is substantially lowered — as in the case of a strong 
r(x ent population botdeneck. 

Matching the mean density of polymorphisms to a 
constant size model. The alternate demographic scenarios 
that we investigated here have focused on the performance of Ti, 
T2, HKA, and Tajima's D for a recent population botdeneck or 
growth, relative to a constant size population. However, we have 
not considered whether a population botdeneck or growth actually 
changes the absolute performance of the methods, as these 
demographic events not only change the density of polymorphisms 
relative to constant size models, but they also change the shape of 
the frequency spectrum. To control for the density of polymor- 
phisms, we chose the ancestral effective size under the bottleneck 
and growth models so that the expected numl)cr of segregating 
sites under the botdeneck and growth models is the same as a 
constant size model of diploid effective size 10''. That is, we set the 
ancestral sizes for complex demographic models such that these 
complex models yield identical mean densities of polymorphic sites 
as a model of constant population size of iC diploid individuals. 
The details on how we chose these ancestral effective sizes can be 
found in the Materials and Methods section, with the ancestral 
diploid effective sizes under the botdeneck and growth models as 
14015 and 8762, respectively. 

Figures S12 and S13, Figures S14 and S15, and Figures S16 and 
SI 7 display results for times Zs at which a balanced polymorphism 
arose of 5x10*, 1.5x10', and 10^ years ago, respectively. 
Interestingly, these results indicate that the botdeneck and growth 
models behave similarly to a constant size model once the mean 
density of polymorphic sites is matched to that of a constant size 
model. That is, there no longer is a substantial improvement for 
Ti, T2, and HKA for botdeneck models relative to a constant size 
model. Hence, it is not the shape of the frequency spectrum that 
gave the apparent increase in power under the botdeneck model 
(e.g., compare Fig. 3 to Fig. 5 and Fig. 4 to Fig. 6). Rather, it was 
the large decrease in the background density of polymorphisms 
relative to that of the assumed effective population size under the 
model of balancing selection. In addition, when matching the 
mean density of polymorphisms, methods tended to perform better 
under the growth model than under the bottleneck model (e.g.. 
Figs. 5 and 6), counter to what was observed without matching the 
mean density of polymorphisms (e.g., compare Fig. 3 to Fig. 5 and 
Fig. 4 to Fig. 6). This observation is potentially due to the 
increased variance in coalescence times under the new bottleneck 
model compared to the new growth model, when the mean density 
of polymorphisms is matched to a constant size model. 

Empirical analysis 

Balancing selection in humans. We probed the effects of 
balancing selection in humans by using whole-genome sequencing 
data from nine unrelated individuals from the CEU population 
and nine unrelated individuals from the YRI population (see 
Materials and Methods). We performed a scan for balancing 
selection at each position in our dataset by considering a window 
of 100 substitutions or polymorphisms upstream and downstream 
of our focal site. This window size was taken for computational 
convenience, rather than by consideration of the recombination 
rate or polymorphism density within the region. Though we used a 
window size of 200 polymorphisms or substitutions for computa- 
tional convenience, Ti and T2 can also be computed using all sites 
on a chromosome. The mean window length was ~ 1 4. 7 kb for the 
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Figure 5. Performance of T\, T2, HKA, and Tajima's D under tlie bottlenecl< and growtKi demographic models in Figure 2 with 
selection parameter 5= 10^' and dominance parameter /i = 100. The left panel is the divergence model in Figure 26 with a recent bottleneck 
within the ingroup species. The right panel is the divergence model in Figure 2C with recent population growth within the ingroup species. The 
population sizes for the bottleneck and growth demographic histories have been scaled so that they produce the same number of segregating sites 
as a constant size population with diploid effective size A'=10'' individuals. 
doi:1 0.1 371 /journal.pgen.1 004561 .g005 
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Figure 6. Performance of Ti, T2, HKA, and Tajima's D under the bottleneck and growth demographic models in Figure 2 with 
selection parameter .5=10^'' and dominance parameter /i = 100. The left panel is the divergence model in Figure 26 with a recent bottleneck 
within the ingroup species. The right panel is the divergence model in Figure 2C with recent population growth within the ingroup species. The 
population sizes for the bottleneck and growth demographic histories have been scaled so that they produce the same number of segregating sites 
as a constant size population with diploid effective size A' = 10'' individuals. 
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Figure 7. Signals of balancing selection within the HLA region for the CEU (blue) and YRI (orange) populations using the r2 test 
statistic. From bottom to top, the horizontal dotted gray lines indicate the 0.5%, 0.1%, 0.05%, and 0.01% empirical cutoffs, respectively. 
doi:1 0.1 371 /journal.pgen.1 004561 .g007 



CEU and ^13.7 kb for the YRI populations, which should be 
sufTiciently long because recombination quickly breaks down the 
signal of balancing selection at distant neutral sites. That is, under 
the Hudson-Darden-Kaplan model, the scale at which one would 
observe an increase in diversity is 1/ p=l/ (4Nr) = 
1/(4 X 10^^x2.5 X 10"^)= 1000 nucleotides, or a 1 kb window 
[21]. Manhattan plots for Ti (Figs. S18 and S19) and Tj (Figs. S20 
and S21) test statistics suggest that there are multiple oudier 
candidate regions. Intersecting the locations of these scores with 
those from the longest transcript of each RefSeq gene {i.e.^ 
transcription start to stop including exons and introns) led to 
identification of many previously-hypothesized and novel genes 
potentially undergoing balancing selection (see Tables S1-S4, with 
previously-hypothesized genes highlighted in bold). 

Multiple genes at the HLA region are strong outliers (top 0.01% 
of all scores across the genome) in our scan for balancing selection 
(Tables S1-S4). Because this study uses high-coverage sequencing 
data, resolution in the HLA region is particularly fine (Figs. S22 
and 7), with strong signals in classical MHC genes such as HLA-A ^ 
HLA-B, HLA-C, HLA-DR, HLA-DQ, and HLA-DP genes [14]. 
The HLA region, which is located on chromosome six, is a 



well-known site of balancing selection in humans [8-10]. The 
protein products encoded by HLA genes are involved in antigen 
presentation, thus playing important roles in immune system 
function. Genes at the HLA locus are known to be highly 
polymorphic and are thought to be subject to balancing selection 
due to frequency-dependent selection, overdominance, or fluctu- 
ating selection in a rapidly changing pathogenic environment 
[30,31]. As the HLA region is so well known as a locus under 
balancing selection, it is important that our methods identify 
strong candidate candidate genes in the regions as a proof of 
concept. 

One gene that we found particularly intriguing is FANKI (Figs. 
S23 and 8). This gene is one of the top four candidates in the CEU 
and YRI populations when using either the Ti or T2 statistic 
(Tables S1-S4). In addition, FANKI is the top candidate among 
genes that have not been previously hypothesized to be under 
balancing selection when using either test in the CEU and the Ti 
test in the YRI. FANKI is expressed during the transition from 
diploid to haploid state in meiosis [32,33]. Though it is often 
identified as spermatogenesis-specific [32,33], it is also expressed 
during oogenesis in cattle [34] and mice [35]. Its function is to 
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Figure 8. Signal of balancing selection at the FANK1 gene for the CEU (blue) and YRI (orange) populations using the T2 test statistic. 

From bottom to top, the horizontal dotted gray lines indicate the 0.5%, 0.1%, 0.05%, and 0.01% empirical cutoffs, respectively. SNPs (rslDs) 
correspond to markers showing significant levels of transmission distortion within the IVIeyer ef a/, study [37], 
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suppress apoptosis [33], and it is one often to 20 genes identified 
as being imprinted in humans [i.e., allele specific methylation) 
[36]. Interestingly, it also shows marginal evidence of segregation 
distortion (Fig. 8) [37]. Further, as a CpG island resides directly 
underneath our signal in both the CEU and YRI populations, we 
analyzed the region around FANKl with all GC—>AT transitions 
on chromosome 10 removed as well as all transitions on 
chromosome 10 removed and we still retain the peak (Fig. S24), 
strongly suggesting that the signature of balancing selection that 
we identified around FANKl is not driven by CpG mutational 
effects. We were additionally surprised to find that the putative 
selection signal was approximately 40 kb wide, which is abnor- 
mally large for balancing selection. Looking back at the 
recombination map, we find that the rates in this region are 
extremely low, which explains the large width of the peak. 
However, Figures S5 and S6 indicate that erroneously inferring a 
lower recombination rate does not increase the power of detecting 
a selection signal, and can substantially impair the ability for T2 to 
detect a selection signal. 

More broadly, a glance at the top signals for the CEU (Tables 
SI and S3) and YRI (Tables S2 and S4) populations, reveals a 
substantial overlap in the candidate genes identified between the 
pair. If balancing selection has maintained a polymorphism for a 
long period of time, then we would expect these populations to 
share many signals in common due to their relatively recent 
population split. Tables S1-S4 indicate that our scan also 
identified a number of genes that were previously-hypothesized 
to be under balancing selection. However, the majority of this 
overlap is due to the HLA region. One candidate that we did not 
find support for was the ABO gene, which has been identified as a 
potential strong candidate using diverse complementary approach- 
es such as summary statistics [38] and trans-specific polymorphism 
information [7] . A number of factors, including the small sample 
size for each of the CEU and YRI populations used here and 
potential differences in the Complete Genomics dataset relative to 
others, could have caused the ABO gene to not be at the top of our 
list of candidates. 

Gene ontology analysis. To elucidate functional similari- 
ties among genes identified to be under balancing selection, we 



performed gene ontology (GO) enrichment analysis using 
GOrilla [39,40] . First, we compared an unranked list of the 
top 100 candidate genes (Tables S1-S4) to the background list 
of all unique genes. Genes obtained using either test statistic are 
enriched for processes involved in the immune response in both 
the CEU and YRI populations (Tables S5-S8). Similarly, the 
top genes are enriched for MHC class II functional categories 
(Tables S9-S1 1), with the exception of the Tj statistic applied to 
YRI, which has no functional enrichment. Further, these top 
genes tend to be components of the MHC complex and 
membranes (Tables S12-S15), which often directly interact with 
pathogens. Interestingly, removing all HLA genes from both the 
top 100 and background sets of genes reveals no GO 
enrichment for process, function, or component categories, 
indicating that enrichment is predominately driven by the HLA 
region. Because we can also provide a score for each candidate 
gene in our likelihood framework, we performed a second 
analysis in which we ranked genes by their likelihood ratio test 
statistic, with the goal of identifying GO categories that are 
enriched in top-ranked genes. Using this framework, the top 
candidate genes tend to be involved in immune response and 
cell adhesion processes (Tables S16-S19); MHC activity and 
membrane protein activity functions, such as transporting 
and binding molecules (Tables S20-S23); and MHC complex, 
membrane, and cell junction components (Tables S24-S27). In 
contrast to the case of the top 100 candidate genes, removing all 
HLA genes from the ranked list still resulted in GO enrichment 
in categories such as cell adhesion (processes), membrane 
protein activity (function), and components of membranes and 
cell junctions (component). 

Discussion 

In this article, we presented two likelihood-based methods, T\ 
and T2, to identify genomic sites under balancing selection. These 
methods combine intra-species polymorphism and inter-species 
divergence with the spatial distribution of polymorphisms and 
substitutions around a selected site. Through simulations, we 
showed that T\ and T2 vastly outperform both the HKA test and 
Tajima's D under a diverse set of demographic assumptions, such 
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as a population bottleneck and growth. In addition, application of 
T\ and to whole-genome sequencing data from Europeans and 
Africans revealed many previously identified and novel loci 
displaying signatures of balancing selection. 

Simulation results suggest that T2 performs at least as well as T\ , 
and so a natural question is whether T\ would ever be used. Based 
on the fact that T2 uses the allele frequency spectrum and T\ does 
not, then T\ wotjld be a valuable statistic to employ when allele 
frequencies cannot be estimated well. One example is a situation 
in which the sample size is small (e.g., one or two genomes). Under 
this scenario, the Tj test statistic would likely provide little 
additional power over the T\ statistic. As another example, it is 
becoming increasingly common for studies to sequence a pooled 
sample of individuals rather than each individual in the sample 
separately. This pooled sequencing wiU tend to yield inaccurate 
estimates of allele frequencies across the genome, which could 
heavily influence the performance of the Tj statistic. However, if 
there is sufiicient enough (;\ idence that a site has a pair of alleles 
observed in the sample, then this site can be considered 
polymorphic regardless of its actual allele frequency. Future 
developments that can statistically account for this uncertainty in 
allele frequency estimation could be incorporated into the T2 test 
statistic so that it can be applied to pooled sequencing data. In 
addition, our investigation into the robustness of T\ and T2 to 
errors in recombination rate estimates suggested that T\ tends to 
perform better than T2 when the estimate of the recombination 
rate is inaccurate. Because reliable genetic maps are unavailable 
for most organisms that have had their genome sequenced, T\ 
may be the preferable statistic for many current applications. 

The model of balancing selection used in this article is from 
Hudson and Kaplan [21], and assumes that natural selection is so 
strong that it maintains a constant allele frequency at the selected 
locus forever. The simulation scenarios considered here assumed 
that the strength of balancing selection was also constant since the 
selected allele arose. However, selection coefficients can fluctuate 
over time, which provides the basis for future work on 
investigating the robustness of methods for detecting balancing 
selection under scenarios in which the strength of selection 
fluctuates or when selection is weak. Future work can use the 
framework developed here to construct methods for identifying 
balancing selection under models with more relaxed assumptions 
(e.g., see Barton and Etheridge [41] and Barton et al. [42] for 
potential models). 

Recall that we chose a window size based on a fixed number of 
polymorphisms and substitutions. However, we could have chosen 
a window in a different way. For example, a window could have 
been chosen based on physical or genetic distance, rather than a 
fixed ntmiber of substitutions or polymorphisms. However, basing 
each likelihood calculation on a fixed number of substitutions or 
polymorphisms, rather than physical or genetic distance, enables 
each likelihood ratio to be based on the same number of terms, 
thereby letting the likelihood ratio depend on the density of 
polymorphisms vs. substitutions rather than the ntmiber of 
polymorphisms in the window. This contrasts other composite 
likelihood approaches for detecting positive selection (e.g., Nielsen 
et al., 200.^ [26]), where the likelihood under the selection model 
approaches the likelihood under neutrality with increasing 
distance from the site under selection. This characteristic exhibited 
by these other composite likelihood approaches permits variable- 
size windows, so that at some point adding new terms to the 
likelihood ratio will not change its value. However, for our 
method, the likelihood under selection does not approach the 
likelihood under the background level of diversity (neutrality) with 
increasing distance from the putative site under selection, causing 



the value of the likelihood ratio to change by modifying the 
number of terms. If we chose a standard neutral model for the nuU 
hypothesis, then the likelihood under selection would approach the 
likelihood under the null model with increasing distance from the 
selected site. To attempt to account for demographic history, we 
have instead chosen to use the genome-wide level of diversity for 
the nuU hypothesis, which does not require that the likelihood 
under selection to approach the likelihood under the nuU 
hypothesis with increasing distance from the putative balanced 
polymorphism. 

In our empirical analysis, we calculated the likelihood ratio (T^i 
or T2) for numerous positions along the genome. We then ranked 
genes according to the largest likelihood ratio estimated between 
the annotated transcription start and stop of the gene. A 
consequence of ranking genes in this manner is that longer genes 
are more likely to be significant. However, because ancient 
balancing selection only impacts a relatively small region of the 
genome (in contrast to recent positive selection), the signal of 
ancient balancing selection could be masked if we instead assigned 
the average likelihood ratio as the score for a large gene. We 
therefore opted to assign the score for a gene as the highest 
likelihood ratio calculated within that gene. 

Our methods have been shown to be substantially more 
powerful than HKA and Tajima's D at detecting ancient balanced 
polymorphisms. However, a glance at Figures 3 and 4 indicates 
that under constant size and growth models our methods have 
litfle power to detect balanced polymorphisms at low false positive 
rates — a range that would be necessary to detect ancient balancing 
selection if it were rare. Hence, if balancing selection is relatively 
rare, then relying solely on statistics considered here to identify 
ancient balanced polymorphisms could possibly lead to an 
overabundance of false positives. Complementary evidence, such 
as considering patterns of linkage disequilibrium or trans-specific 
polymorphisms in candidate regions, should also be employed to 
hone in on true signals of ancient balancing selection. 

Though we have shown that Ti and Tj perform well under a 
population bottieneck and growth, they may be less robust to other 
forms of demographic model violations, such as population 
structure. Because population subdivision increases the time to 
coalescence and corresponding length of a g('ncalog^', we expect 
higher levels of polymorphism across the genome. Under most 
assumptions, population subdivision affects the genome uniformly; 
it increases the level of background polymorphism and likely only 
shghtly decreases the power of the new statistics. However, in some 
cases, such as an ancient admixture event (e.g., with Neanderthals 
[43] or Denisovans [44]), levels of variability may increase in only 
a few regions of the genome, increasing the mean coalescence time 
in these regions. Such regions may appear to have excess 
polymorphism relative to background levels and, hence, display 
false signals of balancing selection under the T\ statistic. However, 
in non-African humans, introgressed regions typically have low 
population frequencies [43,44], and, hence, it would be unlikely 
for polymorphic sites in these regions to harbor many introgressed 
alleles segregating at intermediate frequencies. Thus, the T2 
statistic, which expUcitiy utilizes allele frequency spectra informa- 
tion, would likely be able to distinguish these blocks of archaic 
admixture from regions of balancing selection. Further, as 
observed in other studies of natural selection [45,46], increased 
robustness to confounding demographic processes can potentially 
be gained through the use of additional information. For example, 
population bottienecks as well as gene flow can increase linkage 
disequilibrium [47,48]. Therefore, knowledge about linkage 
disequilibrium in a region could aid in distinguishing population 
subdivision from long-term balancing selection. 
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Another concern when performing genomes scans for balancing 
selection is the possibility of false positives due to bioinformatical 
errors. For example, misalignment of sequence reads in duplicated 
regions may lead to falsely elevated levels of variability. In many 
cases, this problem can be alleviated by removing duplicated 
regions from analyses. However, a non-negligible portion of the 
human genome is not represented in standard reference sequences 
and, thus, there may be many unidentified paralogs in the 
genome. Fortunately, removing sites that deviate from Hardy- 
Weinberg equilibrium helps to alleviate these problems, because 
SNPs fixed between or segregating at high frequencies in one of 
two (or more) paralogous regions wUl have an excess of 
heterozygotes in combined short-read alignments. We applied a 
Hardy-Weinberg filter to all empirical data analyzed in this article. 
We note that deviations from Hardy-Weinberg equilibrium are 
expected under certain forms of balancing selection. In theory, a 
balancing selection signal could, therefore, be lost due to such 
filtering. However, we used a filtering cutoff of p<iO^* (see 
Materials and Methods). The strength of selection required to 
cause this type of deviation from Hardy-Weinberg equilibrium 
used in the filtering is extremely strong, and such selection would 
almost certainly have been detected using other methods. Well- 
established examples of balancing selection in the human genome, 
such as the selection affecting the HLA loci, are not lost because of 
filtering, and would generally not be easily detectable using 
deviations from Hardy-Weinberg as a test. Nonetheless, because 
phenomena other than balancing selection, such as bioinforma- 
tical errors or archaic admixture, could potentially lead to false 
signals of balancing selection, additional evidence should be 
obtained before definitively concluding that a site has been 
subjected to balancing selection. 

One source of additional evidence of balancing selection is 
whether a signal lies within a region harboring a trans-specific 
polymorphism [7,19] because it is unlikely to have a polymor- 
phism segregating in each of a pair of closely-related species 
without selection maintaining the polymorphism. However, 
relying solely on evidence from trans-specific polymorphisms 
would miss many true signals of balancing selection that are not 
maintained as trans-specific polymorphisms. In addition, regions 
with bioinformatical errors (e.g., mapping errors) may give the 
same errors in both species, resulting in a false signal of a shared 
polymorphism between the pair of species. Nevertheless, the 
observation of a trans-specific polymorphism can provide 
convincing evidence of an ancient balanced polymorphism 
[7,19]. Previous studies of selection have shown that combinations 
of statistics can be powerful tools when identifying genes under 
selection [1.^,18,49]. Hence, combining our methods with other 
summaries (e.g., linkage disequilibrium [45^8]) or information on 
trans-species polymorphisms [7,19] wUl lead to increasingly 
effective approaches for detecting balancing selection. 

The current approach taken by Ti and T2 ignores higher order 
linkage disequUibria, in the sense that it ignores linkage 
disequilibrium between pairs of neutral markers and only 
considers correlations between neutral markers and the site under 
selection. However, incorporating higher order linkage informa- 
tion, such as employing tests based on haplotypes, could provide 
some advantage. For example, Ti and T2 have little power to 
detect young balanced polymorphisms. However, the haplotype 
pattern around a young balanced polymorphism is likely to mimic 
that of an incomplete or partial selective sweep. Therefore, 
methods that use haplotype information (e.g., EHH [50], iHS 
[51], and hSl [52]), could provide a complementary and powerful 
approach to detecting recent balancing selection — a selective 
regime that the methods considered here have littie power. 



Another commonly-cited source of evidence for balancing 
selection is based on consideration of the topology and branch 
lengths of within-species haplotype trees. Under long-term 
balancing selection, the underlying genealogy (e.g., see Fig. S25) 
will be symmetric, with long basal branches separating a pair of 
allelic classes {i.e., haplotypes containing one variant and 
haplotypes containing the other variant). However, the underlying 
genealogy for a linked neutral variant may differ substantially from 
that of the selected site. Around a balanced polymorphism, there 
will be a strong reduction of linkage disequilibrium, not unlike a 
recombination hotspot, because the long genealogy in the 
balanced polymorphism provides extra opportunities for recom- 
bination. Consequently, the signal of balancing selection will be 
narrow, and trees estimated from sites located in a window around 
the balanced polymorphism may fail to detect the presence of 
highly divergent haplotypes. The utility of within-species haplo- 
type trees as a signature of long-term balancing selection is 
unclear, as the genealogy of the haplotype may not match the 
genealogy of the selected region. For example. Figure S26 shows 
that haplotype trees based on scenarios under balancing selection 
appear similar to those under neutrality, with the difference that 
external branches are slightly longer under balancing selection 
than under neutrality, which contrasts with the generally-held 
belief that basal branches should be long. These inferred long 
external branches are a product of estimating haplotype trees in 
recombining regions [53], which would likely be unavoidable in 
genomic regions under ancient balancing selection even if 
recombination events were undetected. As such, haplotype 
networks or trees built without exphcitiy accounting for recombi- 
nation may not be powerful tools for identifying regions under 
balancing selection. 

An assumption of the methods Ti and T2 introduced in this 
article is that two allelic classes at a selected site are maintained for 
an infinitely long period of time at a constant equilibrium 
frequ('ncy by balancing si4(;rti()n. However, balancing selection is 
not restricted to act only on two stable allelic classes, and the 
equilibrium frequency can fluctuate with time and space. 
Examples of balancing selection that do not conform to our 
model assumptions are frequency-dependent selection [2,3], 
fluctuating selection [2,4,5], selection maintained through segre- 
gation distortion [6], and selection maintaining more than two 
allelic classes [6]. Though these modes of balancing selection 
exhibit difierent evolutionary dynamics, they all lead to increased 
diversity around the site under selection, and therefore a decay in 
the density of polymorphisms with increasing genetic distance 
from the selected site. It is this information that Ti and T2 axe 
employing to identify signatures of balancing selection, and though 
the dynamics of these modes of balancing selection violate the 
assumptions of our methods, it is likely that the statistics developed 
here could identify genomic signatures left behind by these 
s(;l(x tive scenarios provided selection was strong enough. 

Within our scan, we identified a gene called FANKl, which is 
expressed during the transition from diploid to haploid states in 
meiosis [32,33], is often identified as spermatogenesis-specific 
[32,33], suppresses apoptosis [33], is imprinted [36], and exhibits 
evidence of segregation distortion (Fig. 8) [37]. These character- 
istics suggest that maintenance of polymorphism at FANKl results 
from segregation distortion, which can occur when the allele 
favored by distortion is associated with negative fitness effects, 
particularly if the negative effect is pronounced in the homozygous 
state (see p. 562-563 of Charlesworth and Charlesworth [6]; 
Ubeda and Haig [54]). The distorting allele will increase in 
frequency when rare because of the segregation distortion in 
heterozygotes. But when it becomes common, selection will act 
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against it because it will more often occur in the homozygous state 
when rare. Under such a scenario, theoretical results suggest that it 
is possible for a distorter to spread through a population without 
reaching fixation, obtaining a frequency that permits the 
maintenance of a stable polymorphism (see p. 564 of Charlesworth 
and Charlesworth [6]). In addition, the inclusion of imprinting at 
such a locus further enchances the parameter space at which a 
polymorphism can be maintained [54]. 

The function ofFANKl makes it a particularly good candidate 
for harboring alleles causing segregation distortion. It is expressed 
primarily during m(;io,sis and inhibits apoptosis, which has 
previously been hypothesized to be associated with segregation 
distortion [55,56]. A large proportion of sperm cells are eliminated 
by apoptosis, so alleKc variants causing avoidance of apoptosis 
after meiosis could serve as segregation distorters. However, 
mutations that lead to avoidance of apoptosis may be associated 
with negative fitness effects, especially in the homozygous states, 
because they could lead to dysspermia or azoospermia. Apoptosis 
during spermatogenesis plays a critical role in maintaining the 
optimal relationship between the number of developing sperm 
cells and Sertoli cells, which support developing sperm cells. 

Though some of the sites identified in FANKl show marginal 
levels of segregation distortion, the region displaying the largest 
level of segregation distortion in the human genome is located 
300 kb upstream of FANKl [37]. Further, a recent genome- wide 
association study for male fertility identified a significant SNP 
(rs9422913) located approximately 250 kb upstream ot FANKl 
[57]. Even though these regions are quite distant from FANKl, if 
strong enough linkage exists with FANKl , then it is possible for a 
two-locus segregation distorter to spread within a population (p. 
569 of Charlesworth and Charlesworth [6]). Hence the signals of 
segregation distortion [37] and fertility [57] displayed in these 
regions upstream of FANKl could be a result of an association 
with FANKl. 

Thus, FANKl is an interesting candidate for further study of 
balancing selection. The association of segregation distortion and 
balancing selection has been empirically described in other species, 
e.g., Caenorliabditis elegans [58]. However, as it has not yet been 
documented in humans, FANKl may be the first example of a 
segregation distorter causing balancing selection in humans. 
However, further experiments would be needed to test the 
hypothesis of segregation distortion in FANKl . 

In the last several years, there has been an accumulation of 
evidence against the pervasiveness of hard sweeps in some species, 
e.g., in humans [11-13]. Instead, other adaptive forces, such as 
balancing selection, could play an important role in shaping 
genetic variation across the genome. Interestingly, a recent 
theoretical study showed that a large proportion of adaptive 
mutations in diploids leads to heterozygote advantage [59], 
suggesting that much of the genome may be under balancing 
selection. If this intriguing prospect is true, then because our 
methods for detecting balancing selection are the most powerful 
that have been developed to date, they will be useful tools in 
uncovering the potc-ntially many regions under balancing selection 
in humans and other species. 

Materials and Methods 

Estimating inter-species expected coalescence times 

To compute the probabilities of polymorphism Pni,pi,x and 
substitution under our model, we must first obtain an 

estimate of the inter-species coalescent times C. For the purposes 
of our simulation and empirical analyses, we introduce a basic 
estimate (C) of the expected coalescence time between the ingroup 



and outgroup species. Consider a sample of n lineages («.«., n 
haploid individuals) from an ingroup species and one lineage from 
an outgroup species. For simplicity, assume that the ingroup 
species, outgroup species, and ancestral species from which the 
ingroup and outgroup diverged has an effective population size of 
A'^= IC* diploid individuals. Further, assume that the per-site per- 
generation mutation rate is ;U = 2.5xlO~'* and that the total 
sequence length analyzed is K. We estimate the expected 
coalescence time of all n lineages in the ingroup species as 
H = 'kl[ANnK{l — \ jri)\, where ft is the mean number of pairwise 
sequence differences and ANiiK{\ — l/n) is the expected number 
of mutations for a sequence of length K and n sampled lineages. 

Suppose that d is the number of substitutions of fixed differences 
observed between the ingroup and outgroup species. Then we 
estimate the mean coalescence time between the ingroup and 
outgroup species by C =[H + d/(2NfiK)]/2. 

Application of the new test statistics to data 

In the empirical analysis of human genomic data, we obtained 
values for the Ti and T2 test statistics for a large number of 
positions spaced across the genome. From these values, we 
overlapped protein coding regions (or genes including exons and 
introns) with the positions in the genome that the test statistics 
were calculated at. We assigned the value of the test statistic for the 
gene as the maximal value of the test statistic for the positions that 
it overlapped. We then ranked the set of genes based on their 
scores to identify genes that are outliers. Note that we are not 
attempting to identify regions with statistical significance or a 
certain p-value threshold, but instead are looking for genes that 
may be outiiers, and so the 0.01, 0.05, 0.10, and 0.50% empirical 
cutoffs are not meant to represent a formal significance cutolT. 

When applying the Ti and Tj test statistics to simulated and 
empirical data, we do not estimate the rate of mutation 61 from Ai 
alleles to A2 alleles or the rate of mutation 62 from A2 alleles to Ai 
alleles at the selected site S, as defined in the Hudson-Darden- 
Kaplan model. We instead treat these rates as a constant, with 
(^1 = 62=0.05 for the analyses in this article. The motivation is 
that, if these mutation rates did not exist, then the tree height 
would increase rapidly for small recombination rates. Our method 
assumes that a most recent common ancestor of the set of sampled 
alleles is reached more recentiy than the inter-species coalescence 
time C between the ingroup and outgroup species {i.e., H„(x,p) 
< C even for small p). Simulation results (see Evaluating the 
methods using simulations) show that our new methods perform 
extremely well, even though we set the nuisance d\ and O2 
parameters to a constant value. To maximize of the equilibrium 
frequency x of the A \ allele, we utilized the value of x, denoted by 
X, that maximized the composite likelihood under the model, by 
choosing x from values of 0.05,0.10, . . . ,0.95. 

Simulation procedure to evaluate the performance of Ti 

and T2 

We applied T\ and T2 to data simulated under population 
divergence models, using parameters to mimic humans (ingroup) 
and chimpanzees (outgroup). The models that we simulated under 
are illustrated in Figure; 2. For each of three models, we set each of 
the ingroup, outgroup, and ancestral j)opulati()n size's to A'^=10'' 
diploid individuals [60] and the divergence time between the 
ingroup and the outgroup species to T/) = 5x 10^ years ago [61]. 
We assumed a generation time of 20 years [62], a mutation 
rate of jU = 2.5xlO^^ mutations per-nucleotide per-generation 
[62], a recombination rate of r = 2.5xl0~^ recombinations 
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per-nucleotide per-generation, and a sequence length of 10^ 
nucleotides. Assuming a per-generation selection coefRcient s, 
where 0 < 5 < 1 , and a dominance parameter h, where h>l, at 
time Ts, a selected allele arose and evolved under an overdom- 
inance model with AiAi homozygotes having fitness 1, A\A2 
heterozygotes having fitness 1 -\-hs, and A2A2 homozygotes having 
fitness \+s. The formulation of this overdominance model is 
similar to that of [63] in which the fitness is A\A\ is 1, A1A2 is 
1 — hs, and A2A2 is 1 — .y. Under the Gillespie formulation, 
overdominance occurs when h<0, whereas it occurs when A > 1 in 
our formulation. However, both result in an equilibrium frequency 
ot{h — l)/(2h— 1). Simulations were performed using mpop [64], 
which was seeded with population-level chromosome data 
generated by the neutral coalescent simulator ms [65]. After the 
completion of each simulation, we sampled 1 8 chromosomes from 
the ingroup species and one chromosome from the outgroup 
species. For each set of parameter values, we simulated 10^ 
independent replicates. Ancestral alleles were called using the 
outgroup species, and so the called ancestral allele may not 
actually be the true ancestral allele. For each of the three 
demographic scenarios, we set Ts = T/) = 5 x 10'' years ago. For 
the botdeneck model (Fig. 2B), we set the botdeneck population size 
to Nh = 550 diploid individuals, the time at which the bottleneck 
began to T/, = 3.0x10'' years ago, and the time at which the 
botdeneck ended to = 2-2 xlO* years ago [66,67] . For the growth 
model (Fig. 2C), we set the expanded population size to 
Ng = 2 X 1 0'' diploid individuals and the time at which the 
population began to grow to T^ = 4.8xl0'* years ago [67]. 
Additionally, we considered a more ancient balanced polymor- 
phism arising = 1 .5 x 10^ years ago and a more recent balanced 
polymorphism arising Ts = 10^ years ago. Because the forward 
simulations in mpop arc computationally burdensome, we rescaled 
appropriate parameters by a factor of 10 such that the scaled 
population parameters remain the same, but the simulations are 
substantially sped up (by approximately a factor of 10^). Note that 
scaling parameters in this way can somewhat affect the time to 
fixation of select(-d alk-lcs. The distribution of false positive rates was 
generated by 1 0^ replicate neutral simulations from mpop, using the 
same parameters as the corresponding selection scenarios (including 
the rescaling factor) except without introducing a selected allele. 

Matching the density of polymorphic sites 

In the current set of simulations, the bottleneck and growth 
models each produce a different density of polymorphisms {i.e., 
number of segregating sites) than the constant size model. This 
section seeks to find an ancestral effective size for the growth and the 

bottleneck models such that the mean density of polymorphisms is 
close to that of the constant size model. We use eq. 1 in Marth et al. 
(2004) [68] to calculate the expected frequency spectrum under the 
botdeneck and growth models. The equation is 



number of epochs to two, we the expected frequency spectrum 
under the growth model as 
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Note that in our growth model, T\=Xg, N\=Ng, and N2=N. 
Denote the ratio of effective size during growth to the ancestral 
effective size as Cg = Ng/N. Then we can rewrite the equation as 
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Consider an an((-stral reference effective size 
{Ne = N =10,000 for the constant size model). Denote the 
expected number of segregating sites in a constant size model, 
conditional on effective size A'^, as ]E[5'*"(A'£,)]. Conditional on this 
ancestral reference effective size Ng, the expected site frequency 
spectrum under our growth model is 
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where Cg = 2 under our growth model. Therefore, the expected 
number of segregating sites conditional on reference effective size 
Ne is E[5G(iV,)] = YllZl Mp°A^<'y\- We obtain a growth model 
that produces the same density of polymorphic sites as our 
constant size model by choosing 
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where fi is the per-generation mutation rate, M is the number of 
epochs, Nm for w= 1,2, . . . ,M, is the effective population size for 

epoch m, and T„ for m=l,2, . . . ,M — 1, is the duration of time 
spent in epoch m. Our growth model contains two epochs, and so 
the appropriate version of the equation is when M = 2. Setting the 



Our botdeneck model contains three epochs, and so the 
appropriate version of the equation is when M = 3. Setting the 
number of epochs to three, we the expected frequency spectrum 
under the botdeneck model as 
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. (20) 



Note that in our bottleneck model, Ti = Tj, Ta = — Tg, 
Ni=N, N2=Nb, and N3=N. Denote the ratio of the effective 

size during the bottleneck to the ancestral effective size as 
Cb = Ni,/ N. Then we can rewrite the equation as 



4ftN 



E 



n-k 
1 



n 



(21) 



Conditional on this reference effective size, the expected site 
frequency spectrum under our bottleneck model is 



EK,,W)1 = 



E- 



(22) 



where c;, = 0.055 under our bottleneck model. Therefore, the 

expected number of segregating sites conditional on reference 

effective size Ne is ElS^(Ne)] = ^\Pn,i(Ne)]- We obtain a 

botdeneck model that produces the same density of polymorphic 
sites as our constant size model by choosing 



are mm 

Nf= ^^|E[5B(iV,)]-E[5C(Af)]| 
NeeZ^ 



arg mm 



N,eZ 
= 14015. 



+ |E[5"(iV,)]-E[5^(10l| 



(23) 



Empirical dataset construction 

We used data from nine European and nine African diploid 
genomes sequenced by Complete Genomics [69]. All individuals 
were unrelated [70], with the European individuals from the 
CEU population (NA06985, NA06994, NA07357, NA10851, 

NA12004, NA12889, NA12890, NA12891, NA12892) and the 
African individuals from the YRI population (NA 18501, 
NA18502, NA18504, NA18505, NA18508, NA18517, 
NA19129, NA19238, NA19329). We used die genotype calls 
made by Complete Genomics that were found in the "master- 
VarBeta" files. We downloaded pairwise alignments between 
human reference hgl8 and chimpanzee reference panTro2 from 
the UCSC Genome Browser at http://genome.ucsc.edu/. Sites 
with more than two distinct alleles across all Complete Genomics 
individuals as well as the hgl8-panTro2 alignments, sites in the 
Complete Genomics data where one of the two alleles did not 
match the reference sequence, and sites that were within two 
nucleotides of structural variants called in any one of the 



Complete Genomics individuals were remo\(;d. In addition, 
combining all 54 unrelated individuals in the Complete 
Genomics dataset, sites that had a ^-value less than 10""* for a 
one-tailed Hardy- Weinberg test of excess heterozygotes [71] were 
excluded. We used the full set of 54 unrelated individuals, 
totalling 108 alleles, so that we would have sufficient power to 
detect Hardy-Weinberg departures due to excess heterozygotes. 
Sites flagged as departing from Hardy-Weinberg proportions in 
this set of 54 individuals were then filtered out in the smaller 
subsets of nine CEU and nine YRI individuals. It should be 
noted that under a scenario of heterozygote advantage, it is 
expected that we should observe an excess of heterozygous 
individuals at sites in the vicinity of the site under balancing 
selection. However, a major concern with sequencing data are 
mapping errors, and so the Hardy-Weinberg filter is necessary to 
reduce the confounding effects of regions with these bioinforma- 
tical artifacts. As a consequence, this filter may increase the 
chance that we miss certain regions that are under balancing 
selection in our scan. Finally, sites that were polymorphic in the 
Complete Genomics sample {i.e., either CEU or YRI) and sites 
that contained a fixed difference between the Complete 
Genomics sample and the chimpanzee reference sequence were 
retained. As in the simulations, the ancestral allele was called 
using the chimpanzee outgroup, and so the called ancestral allele 
may not be the true ancestral allele. However, simulation results 
shows that our new methods perform well even when the 
ancestral allele is potentially misspecified. Further, it may be 
possible to account for ancestral allele misspecification by using 
multiple outgroups, or by statistically accounting for the 
misspecification [72]. 

To obtain recombination rates between pairs of sites, we used 
the sex-averaged pedigree-based human recombination map from 
deCODE Genetics [73]. We constructed recombination rates 
between all pairs of sites in the filtered Complete Genomics 
samples by linearly interpolating rates between adjacent sites 
within the sex-averaged maps. 

Supporting Information 

Figure SI Performance of Ti, Tj, HKA, and Tajima's D under 
the demographic models in Figure 2 with selection parameter 
^=10^^ and dominance parameter h. Each row represents a 
different h value. The first column is the divergence model in 
Figure 2/4. The second column is the divergence model in 
Figure 2B with a recent botdeneck within the ingroup species. 
The third column is the divergence model in Figure 2C with 
recent population growth within the ingroup species. 
(PDF) 

Figure S2 Mean difference in the number of polymorphic sites 
for a model with .? = lO^'* versus one with .? = 10^^ as a function of 
the distance from the site under balancing selection. Simulations 
were performed under the constant size divergence model in 
Figure 2A with selection parameter s, dominance parameter 
A =100, and time of selection T5 = 5x 10^ years ago. The mean 
difference in polymorphic sites is calculated for bins of size one 
kHobase and is plotted for 50 bins. 
(PDF) 

Figure S3 Performance of T\ , T2, HKA, and Tajima's D under 
the demographic models in Figure 2 with selection parameter 
s=\Q~^ and dominance parameter h. Each row represents a 
different h value. The first column is the divergence model in 
Figure 2/4. The second column is the divergence model in 
Figure 2B with a recent botdeneck within the ingroup species. 
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The third column is the divergence model in Figure 2C with 

recent population growth within the ingroup species. 

(PDF) 

Figure S4 Performance of T\, T2, HKA, and Tajima's D under 

the demographic models in Figure 2 with selection parameter 
S=10^^ and dominance parameter /; = 1.125. The first panel is 
the divx-rgence model in Figure 2A. The second panel is the 
divergence model in Figure 2B with a recent bottleneck within the 
ingroup species. The third panel is the divergence model in 
Figure 2C with recent population growth within the ingroup 
species. 
(PDF) 

Figure S5 Performance of Ti and T2 under the constant size 
divergence model in Figure 2A with no selected allele (neutrality). 
The first and second panels are scenarios in which we have 
erroneously over-estimated the recombination rate by two and one 
orders of magnitude, respectively {i.e., we respectively assumed 
recombination rates of 2.5 x 10^* and 2.5 x 10^^ per base per 
generation when the simulations were performed using a rate of 
2.5 X 10^^ per base per generation). The third and fourth panels 
are scenarios in which we have erroneously under-estimated the 
recombination rate by one and two orders of magnitude, 
respectively {i.e., we respectively assumed recombination rates of 
2.5 X 10~' and 2.5x10"'" per base per generation when the 
simulations were performed using a rate of 2.5 x 10^^ per base 
per generation). False positive rate is determined by neutral 
simulations under a model with recombination rate of 2.5 X 10~* 
per base per generation. 
(PDF) 

Figure S6 Performance of Ti and T2 under the constant size 
divergence model in Figure 2A with selection parameter .s= 10^^, 
dominance parameter h=WO or 1.5, and time of selection 
= 5 X 10* years ago. The first and second columns are scenarios 
in which we have erroneously over-estimated the recombination 
rate by two and one orders of magnitude, respectively {i.e., we 
respectively assumed recombination rates of 2. 5x10^* and 
2.5 X 10^' per base per generation when the simulations were 
performed using a rate of 2.5 X 10~* per base per generation). 
The third and fourth columns are scenarios in which we have 
erroneously under-estimated the recombination rate by one and 
two orders of magnitude, respectively {i.e., we respectively 
assumed recombination rates of 2.5 x 10^' and 2.5 x 10^'" per 
base per generation when the simulations were performed using a 
rate of2.5x 10^* per base per generation). False positive rate is 
determined by neutral simulations under a model with recombi- 
nation rate of 2.5 x 10~* per base per generation. 
(PDF) 

Figure S7 Demographic models used in simulations in which a 
selected allele arises prior to the split a pair of species. {A) 
Divergence model. Model parameters are a diploid effective 
population size N, divergence time To of the ingroup and 
outgroup species, and the time Zs when the selected allele arises. 
(5) Divergence model with a recent bottleneck within the ingroup 
species. Additional model parameters are the diploid effective 
population size N/, during the bottleneck, the time T/, when the 
bottleneck began, and the time Te when the bottleneck ended. (C) 
Divergence model with recent population growth within the 
ingroup species. Additional model parameters are the current 
diploid effective population size Ng after recent growth and the 
time Tg when the growth occurred. 
(PDF) 
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Figure S8 Performance of Ti, T2, HKA, and Tajima's D under 
the demographic models in Figure S7 with selection parameter 
s=lO~^ and dominance parameter h. Each row represents a 
different h value. The first column is the divergence model in 
Figure 87^4. The second column is the divergence model in Figure 
S7B with a recent bottieneck within the ingroup species. The third 
column is the divergence model in Figure S7C with recent 
population growth within the ingroup species. 
(PDF) 

Figure S9 Performance of T\, T2, HKA, and Tajima's D under 
the demographic models in Figure S7 with selection parameter 
.S'=10^'* and dominance parameter h. Each row represents a 
different h value. The first column is the divergence model in 
Figure S7^. The second column is the divergence model in Figure 
S7B with a recent bottleneck within the ingroup species. The third 
column is the divergence model in Figure S7C with recent 
population growth within the ingroup species. 
(PDF) 

Figure SIO Performance of Ti, T2, HKA, and Tajima's D 
under the demographic models in Figure 2 with selection 
parameter s= 10^^, dominance parameter h, and time of selection 
T5 = 10'. The first column is the divergence model in Figure 2A. 
The second column is the divergence model in Figure 2B with a 
recent bottleneck within the ingroup species. The third column is 
the divergence model in Figure 2C with recent population growth 
within the ingroup species. 
(PDF) 

Figure Sll Performance of Ti, T2, HKA, and Tajima's D 
under the demographic models in Figure 2 with selection 
parameter s= 10^*, dominance parameter h, and time of selection 
T5= 10'. The first column is the divergence model in Figure 2A. 
The second column is the divergence model in Figure 2B with a 
recent bottieneck within the ingroup species. The third column is 
the divergence model in Figure 2C with recent population growth 
within the ingroup species. 
(PDF) 

Figure S12 Performance of Ti, T2, HKA, and Tajima's D 
under the demographic models in Figure 2 with selection 
parameter 5=10^^ and dominance parameter h. Each row 
represents a different h value. The population sizes for these 
demographic histories have been scaled so that they produce the 
same number of segregating sites as a constant size population 
with diploid efiFective size ^^=10^* individuals. The first column is 
the divergence model in Figure 2B with a recent bottieneck within 
the ingroup species. The second column is the divergence model in 
Figure 2C with recent population growth within the ingroup 
species. 
(PDF) 

Figure S13 Performance of Ti, T2, HKA, and Tajima's D 
under the demographic models in Figure 2 with selection 
parameter 5 = 10^^* and dominance parameter h. Each row 
represents a different h value. The population sizes for these 
demographic histories have been scaled so that they produce the 
same number of segregating sites as a constant size population 
with diploid effective size N= lO'' individuals. The first column is 
the divergence model in Figure 2B with a recent bottieneck within 
the ingroup species. The second column is the divergence model in 
Figure 2C with recent population growth within the ingroup 
species. 
(PDF) 
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Figure S14 Performance of T\, Tj, HKA, and Tajima's D 
under the demographic models in Figure S7 with selection 
parameter s=\Q~^ and dominance parameter h. Each row 
represents a different h value. The population sizes for these 
demographic histories have been scaled so that they produce the 
same number of segregating sites as a constant size population 
with diploid effective size JV= Iff* individuals. The first column is 
the divergence model in Figure S75 with a recent bottleneck 
within the ingroup species. The second column is the divergence 
model in Figure S7C with recent population growth within the 
ingroup species. 
(PDF) 

Figure S15 Performance of T\, T2, HKA, and Tajima's D 
under the demographic models in Figure S7 with selection 
parameter 5=10^^* and dominance parameter h. Each row 
represents a different h value. The population sizes for these 
demographic histories have been scaled so that they produce the 
same number of segregating sites as a constant size population 
with diploid effective size N=W^ individuals. The first column is 
the divergence model in Figure S7fi with a recent bottleneck 
within the ingroup species. The second column is the divergence 
model in Figure S7C with recent population growth within the 
ingroup species. 
(PDF) 

Figure S16 Performance of Ti, T2, HKA, and Tajima's D 
under the demographic models in Figure 2 with selection 
parameter s=lO~^, and dominance parameter h, and time of 
selection Ts = 10^. Each row represents a different h value. The 
population sizes for these demographic histories have been scaled 
so that they produce the same number of segregating sites as a 
constant size population with diploid effective size N=10* 
individuals. The first column is the divergence model in Figure 2B 
with a recent botdene[:k within the ingroup species. The second 
column is the divergence model in Figure 2C with recent 
population growth within the ingroup species. 
(PDF) 

Figure S17 Performance of Ti, T2, HKA, and Tajima's D 
under the demographic models in Figure 2 with selection 

parameter .s=10^^, and dominance parameter h, and time of 
selection 15= 10'. Each row represents a different h value. The 
population sizes for these demographic histories ha\ e been scaled 
so that they produce the same number of segregating sites as a 
constant size population with diploid effective size A'^=10'* 
individuals. The first column is the divergence model in Figure 2B 
with a recent botdeneck within the ingroup species. The second 
column is the divergence model in Figure 2C with recent 
population growth within the ingroup species. 
(PDF) 

Figure S18 Manhattan plot of genome-wide scans for balancing 
selection within the GEU population using the T\ test statistic. 
From bottom to top, the horizontal dotted gray lines indicate the 
0.5%, 0.1%, 0.05%, and 0.01% empirical cutoffs, respectively. 
The j-axis is truncated at log composite likelihood ratio of zero. 
(PDI^ 

Figure S19 Manhattan plot of genome- wide scans for balancing 
selection within the YRI population using the Ti test statistic. 
From bottom to top, the horizontal dotted gray Unes indicate the 
0.5%, 0.1%, 0.05%, and 0.01% empirical cutoffs, respectively. 
The j'-axis is truncated at log composite likelihood ratio of zero. 
(PDF) 



Figure S20 Manhattan plot of genome-wide scans for balancing 
selection within the GEU population using the T2 test statistic. 
From bottom to top, the horizontal dotted gray lines indicate the 
0.5%, 0.1%, 0.05%, and 0.01% empirical cutoffs, respectively. 
The y-axis is truncated at log composite likelihood ratio of zero. 
(PDF^ 

Figure S21 Manhattan plot of genome-wide scans for balancing 
selection within the YRI population using the T2 test statistic. 
From bottom to top, the horizontal dotted gray hnes indicate the 
0.5%, 0.1%, 0.05%, and 0.01% empirical cutoffs, respectively. 
The j-axis is truncated at log composite likelihood ratio of zero. 
(PDF^ 

Figure S22 Signals of balancing selection within the HIA region 
for the CEU (blue) and YRI (orange) populations using the Ti test 
statistic. From bottom to top, the horizontal dotted gray lines 
indicate the 0.5%, 0.1%, 0.05%, and 0.01% empirical cutoffs, 
respectively. 
(PDF) 

Figure S23 Signal of balancing selection at the FANKl gene for 
the CEU (blue) and YRI (orange) populations using the Ti test 

statistic. From bottom to top, the horizontal dotted gray lines 
indicate the 0.5%, 0.1%, 0.05%, and 0.01% empirical cutoffs, 
respectively. SNPs (rsIDs) correspond to markers showing 
significant levels of transmission distortion within the Meyer et al. 
study. 
(PDF) 

Figure S24 Signal of balancing selection at the FANKl gene for 
the CEU (blue) and YRI (orange) populations when removing 

either GC~*AT transitions or all transitions. SNPs (rsIDs) 
correspond to markers showing significant levels of transmission 
distortion within the Meyer et al. study. 
(PDF) 

Figure S25 Genealogy at the site of balancing selection. 
(PDF) 

Figure S26 Haplotype trees based on randomly sampling 18 
haplotypes without replacement from a random simulation under 
the model in Figure S7A. Trees were generated using UPGMA 
applied to a distance matrix of the proportion of nucleotide 
differences between each pair of haplotypes. The x-kilobase (kb) 
window represents a region that is x kb in length and is centered in 
the middle of the haplotype. 
(PDF) 

Table SI Top 100 signals in the GEU population using the Ti 

test statistic. 

(PDF) 

Table S2 Top 100 signals in the YRI population using the Ti 

test statistic. 

(PDF) 

Table S3 Top 100 signals in the GEU population using the T2 

test statistic. 

(PDF) 

Table S4 Top 100 signals in the YRI population using the T2 

test statistic. 

(PDF) 

Table S5 GO process analysis of top 100 signals, when 
compared to all signals, from GEU population using the Ti test 
statistic. 
(PDF) 
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Table S6 GO process analysis of top 100 signals, when compared 
to all signals, from YRI population using the T\ test statistic. 
(PDF) 

Table S7 GO process analysis of top 100 signals, when compared 
to all signals, from CEU population using the T2 test statistic. 
(PDF) 

Table S8 GO process analysis of top 100 signals, when compared 
to all signals, from YRI population using the T2 test statistic. 
(PDF) 

Table S9 GO function analysis of top 100 signals, when 
compared to all signals, from CEU population using the T\ test 
statistic. 
(PDF) 

Table SIO GO function analysis of top 100 signals, when 
compared to all signals, from YRI poptdation using the T\ test 
statistic. 
(PDF) 

Table Sll GO function analysis of top 100 signals, when 
compared to all signals, from CEU population using the T2 test 
statistic. 
(PDF) 

Table SI 2 GO component analysis of top 100 signals, when 
compared to all signals, from CEU population using the T\ test 
statistic. 
(PDF) 

Table S13 GO component analysis of top 100 signals, when 
compared to all signals, from YRI population using the T\ test 
statistic. 
(PDF) 

Table S14 GO component analysis of top 100 signals, when 
compared to all signals, from CEU poptilation using the T2 test 
statistic. 

(PDF) 

Table S15 GO component analysis of top 100 signals, when 
compared to all signals, from YRI population using the T2 test 
statistic. 
(PDF) 

Table S16 GO process analysis of ranked signals from CEU 

population using the T\ test statistic. 

(PDF) 

Table S17 GO process analysis of ranked signals from YRI 

population using the T\ test statistic. 

(PDF) 



Table S18 GO process analysis of ranked signals from CEU 

population using the T2 test statistic. 

(PDF) 

Table S19 GO process analysis of ranked signals from YRI 

population using the T2 test statistic. 

(PDF) 

Table S20 GO function analysis of ranked signals from CEU 

population using the T\ test statistic. 

(PDF) 

Table S21 GO function analysis of ranked signals from YRI 

population using the T\ test statistic. 

(PDF) 

Table S22 GO fimction analysis of ranked signals from CEU 
population using the T2 test statistic. 

(PDF) 

Table S23 GO function analysis of ranked signals from YRI 
population using the T2 test statistic. 

(PDF) 

Table S24 GO component analysis of ranked signals from CEU 
population using the T\ test statistic. 

(PDF) 

Table S25 GO component analysis of ranked signals from YRI 
population using the T\ test statistic. 

(PDF) 

Table S26 GO component analysis of ranked signals from CEU 
population using the T2 test statistic. 

(PDF) 

Table S27 GO component analysis of ranked signals from YRI 

population using the T2 test statistic. 

(PDF) 
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